Parallel Texts Extraction from Multimodal Comparable Corpora
نویسندگان
چکیده
Statistical machine translation (SMT) systems depend on the availability of domain-specific bilingual parallel text. However parallel corpora are a limited resource and they are often not available for some domains or language pairs. We analyze the feasibility of extracting parallel sentences from multimodal comparable corpora. This work extends the use of comparable corpora by using audio sources instead of texts on the source side. The audio is transcribed by an automatic speech recognition system and translated with a baseline SMT system. We then use information retrieval in a large text corpus of the target language to extract parallel sentences. We have performed a series of experiments on data of the IWSLT’11 speech translation task that shows the feasibility of our approach.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملMultimodal Comparable Corpora as Resources for Extracting Parallel Data: Parallel Phrases Extraction
Discovering parallel data in comparable corpora is a promising approach for overcoming the lack of parallel texts in statistical machine translation and other NLP applications. In this paper we propose an alternative to comparable corpora of texts as resources for extracting parallel data: a multimodal comparable corpus of audio and texts. We present a novel method to detect parallel phrases fr...
متن کاملBetter handling of a bilingual collection of texts
Statistical machine translation models are trained from parallel corpora, which are collections of translated texts. These texts are usually processed using dedicated tools called “sentence aligners”, which output parallel sentence pairs. However, parallel resources are very scarce in certain languages or domains. Alternative solutions have been proposed that extract parallel sentences from the...
متن کاملSurvey on Comparable Corpora until June 2012
Here we present a survey of important work done on Comparable Corpora between the period 1995 to 2012. Unlike parallel corpora, which are clearly defined as translated texts, there is a wide variation of non-parallelism in comparable text. Non-parallelism is manifested in terms of differences in author, domain, topics, time period, language. The most common text corpora have non-parallelism in ...
متن کاملDevelopment and Application of a Cross-language Document Comparability Metric
In this paper we present a metric that measures comparability of documents across different languages. The metric is developed within the FP7 ICT ACCURAT project, as a tool for aligning comparable corpora on the document level; further these aligned comparable documents are used for phrase alignment and extraction of translation equivalents, with the aim to extend phrase tables of statistical M...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012